-
Notifications
You must be signed in to change notification settings - Fork 247
Draft: Use Megatron-Bridge recipes for megatron_cfg. #1894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR introduces Megatron-Bridge recipe support by replacing hardcoded Changes
Sequence DiagramsequenceDiagram
participant Config as Config Loader
participant Recipe as Recipe Module
participant Setup as Megatron Setup
participant Validation as Validation
participant ConfigContainer as Config Container
Config->>Setup: Provide policy config with megatron_recipe
Setup->>Recipe: Call load_recipe(recipe_path)
Recipe->>Recipe: Parse fully-qualified path
Recipe->>Recipe: Dynamic import module
Recipe->>Recipe: Resolve function name
Recipe->>ConfigContainer: Invoke recipe function
ConfigContainer-->>Recipe: Return ConfigContainer
Recipe-->>Setup: Return loaded megatron_cfg
Setup->>Setup: Merge checkpoint/policy configs into megatron_cfg
Setup->>Setup: Update megatron_cfg sub-configs (optimizer, scheduler, ddp, tokenizer)
Setup->>Validation: Validate optimizer_config(megatron_cfg)
Setup->>Validation: Validate training_config(megatron_cfg)
Validation-->>Setup: Validation complete
Setup-->>Config: Return finalized megatron_cfg
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.yaml (2)
42-47:⚠️ Potential issue | 🟡 MinorLog paths reference
.v2but the recipe filename is.v3.The
log_dirandwandb.namestill contain.v2, which will cause confusion when distinguishing runs from the actual v2 config.Proposed fix
logger: - log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2 + log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3 wandb_enabled: true tensorboard_enabled: true wandb: project: nemo-rl - name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2 + name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3
19-32:⚠️ Potential issue | 🟡 MinorUpdate version references from v2 to v3.
The filename was bumped to
.v3but thelog_dir(line 42) andwandb.name(line 47) still reference.v2:log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2 wandb.name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2Change both to
.v3for consistency with the filename.examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml (1)
21-24:⚠️ Potential issue | 🔴 CriticalUse the correct base config for megatron recipes.
The interpolations on lines 21–22 reference
${.megatron_cfg.tensor_model_parallel_size}and${.megatron_cfg.context_parallel_size}, but the recipe inherits from../../distillation_math.yaml, which only definesdtensor_cfgand lacks themegatron_cfgblock needed for these interpolations to resolve. Change the defaults to../../distillation_math_megatron.yaml, which provides the required megatron configuration with proper tensor and context parallel sizes.
🤖 Fix all issues with AI agents
In `@examples/configs/grpo_math_70B_megatron.yaml`:
- Around line 25-26: The config mixes Llama 3.3 model with a Llama 3.1 recipe:
update either the model string or the recipe to match versions—specifically
change the model from "meta-llama/Llama-3.3-70B-Instruct" to
"meta-llama/Llama-3.1-70B-Instruct" to match the existing megatron_recipe value
"llama31_70b_pretrain_config", or replace "megatron_recipe:
megatron.bridge.recipes.llama.llama3.llama31_70b_pretrain_config" with a Llama
3.3–compatible recipe if you have one; ensure consistency across similar configs
like grpo_math_8B_megatron.yaml and sft_openmathinstruct2_megatron.yaml.
In
`@examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml`:
- Line 25: The policy's megatron_recipe is pointing to the 32B config
(megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config) while the policy
model name is Qwen/Qwen3-1.7B-Base; change megatron_recipe to the matching 1.7B
recipe (if one exists) or remove the megatron_recipe entry so the policy uses
the HuggingFace model directly (like the FSDP variant) to ensure the student
model architecture matches Qwen/Qwen3-1.7B-Base.
In
`@examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml`:
- Line 11: The recipe filename advertises TP=2 but the config lacks that
override; update the policy section to explicitly set
tensor_model_parallel_size: 2 (ensuring it overrides the base default
tensor_parallel_size: 1) so the effective parallelism matches the filename, or
alternatively rename the file to remove the "megatrontp2" segment; locate the
megatron_recipe entry and the policy block in the YAML (look for keys named
policy, tensor_model_parallel_size, and tensor_parallel_size) and add the
explicit tensor_model_parallel_size: 2 entry or adjust the filename accordingly.
In `@nemo_rl/models/megatron/recipe_config.py`:
- Line 1: Update the copyright header year from 2025 to 2026 in the file (the
top-line comment in recipe_config.py); locate the existing copyright comment and
change the year token "2025" to "2026" so the file header reflects the current
year.
In `@nemo_rl/models/megatron/setup.py`:
- Around line 298-303: The FileNotFoundError message for pretrained_run_config
is truncated; update the exception text (in the block that checks
os.path.exists(pretrained_run_config) where pretrained_run_config and rank are
referenced) to complete the sentence — e.g., explain that the one-time HF->mcore
conversion on rank=0 may have written to a directory not mounted on this node
and instruct the user to verify that the directory is mounted and that rank 0
wrote the config (include the path and rank variables in the message for
clarity).
- Around line 493-496: The validators access megatron_cfg.optimizer without
checking for None; add a guard at the start of _validate_optimizer_config (and
similarly in _validate_dtype_config) to return early if megatron_cfg.optimizer
is None (or otherwise bail with a no-op) so you don't dereference None when
reading optimizer_cpu_offload, optimizer_offload_fraction,
use_precision_aware_optimizer, bf16 or fp16; reference the
megatron_cfg.optimizer object in those functions and ensure any downstream logic
expects a present optimizer only after this check.
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Around line 281-293: Remove the debug block that prints "HELLO", writes a
hardcoded YAML path, and calls sys.exit during initialization: delete the
unconditional print("HELLO") and the entire rank-0 conditional that calls
self.megatron_cfg.to_yaml(...), imports sys, prints debug/warning messages, and
invokes sys.exit(0). Ensure no hardcoded paths remain and do not perform any
early process termination in __init__; if you need to persist the config,
replace this with a safe, optional debug hook (e.g., gated by a proper config
flag) that runs non-destructively and is synchronized across ranks rather than
exiting the process.
🧹 Nitpick comments (4)
examples/configs/recipes/llm/grpo-dapomath17k-dsv3-megatron.yaml (1)
1-59: Filename doesn't follow the LLM recipe naming convention.The current filename
grpo-dapomath17k-dsv3-megatron.yamlis missing the<nodes>n<gpus>gcomponent required by the naming pattern. Based on the cluster config (num_nodes: 32,gpus_per_node: 8), a compliant name would look something like:grpo-dsv3-32n8g-megatron-dapomath17k.yamlThis is pre-existing and not introduced by this PR, so feel free to defer. As per coding guidelines, "For LLM recipes, follow the naming pattern:
<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)".examples/configs/grpo_math_qwen30ba3b_megatron.yaml (1)
29-30: Nit: trailing blank line aftermegatron_recipe.Line 30 appears to be a changed line that's just whitespace. Consider removing the trailing blank line to keep the config tidy.
Proposed fix
megatron_recipe: megatron.bridge.recipes.qwen.qwen3.qwen3_30b_pretrain_config - megatron_cfg:nemo_rl/models/megatron/setup.py (2)
592-604:_update_dataclass_fieldssilently ignores unknown keys — could mask config typos.
hasattron line 603 means a misspelled key (e.g.,learning_rateinstead oflr) is silently dropped. Consider logging a warning for unrecognized keys to aid debugging.♻️ Proposed improvement
def _update_dataclass_fields(target: Any, updates: dict) -> None: for key, value in updates.items(): if hasattr(target, key): setattr(target, key, value) + else: + warnings.warn( + f"Ignoring unknown config key '{key}' for {type(target).__name__}. " + "Check for typos in megatron_cfg overrides." + )
632-637: Hidden default value1fortrain_global_batch_size.Line 635 uses
config.get("train_global_batch_size", 1)which introduces a hidden default. Per coding guidelines, YAML should be the single source of truth for defaults, and required config values should be accessed directly. Even though the comment says this is "ignored by NeMo-RL", Megatron itself may use it for internal validation. Consider usingconfig["train_global_batch_size"]to surface misconfiguration early. As per coding guidelines: "Access required config values directly (e.g.,policy_cfg['precision']) and assume they are present; do not introduce hidden defaults in code."📝 Proposed fix
- megatron_cfg.train.global_batch_size = config.get("train_global_batch_size", 1) # ignored by NeMo-RL + megatron_cfg.train.global_batch_size = config["train_global_batch_size"] # ignored by NeMo-RL
| megatron_recipe: megatron.bridge.recipes.llama.llama3.llama31_70b_pretrain_config | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for any llama3.3 or llama33 references in the codebase
echo "=== Searching for llama3.3 or llama33 references ==="
rg -i 'llama3[._]?3' --type yaml --type py -g '!*.pyc' 2>/dev/null | head -50
echo ""
echo "=== Searching for all llama pretrain_config references ==="
rg 'llama.*pretrain_config|llama.*config' --type py --type yaml 2>/dev/null | head -50
echo ""
echo "=== Searching for megatron.bridge.recipes.llama directory/references ==="
find . -path '*/megatron*' -name '*llama*' 2>/dev/null | grep -E '\.(py|yaml)$' | head -30Repository: NVIDIA-NeMo/RL
Length of output: 3275
🏁 Script executed:
#!/bin/bash
# Read the full file to see if there are comments explaining the recipe choice
echo "=== Full content of grpo_math_70B_megatron.yaml ==="
cat examples/configs/grpo_math_70B_megatron.yaml
echo ""
echo "=== Search for TODOs or FIXMEs related to llama or recipe ==="
rg -n 'TODO|FIXME' examples/configs/grpo_math_70B_megatron.yaml
echo ""
echo "=== Check megatron.bridge documentation or references ==="
rg -n 'megatron.*bridge.*llama' --type py --type md 2>/dev/null | head -20
echo ""
echo "=== Search for Llama 3.3 in model name references ==="
rg -n 'Llama-3.3|Llama 3.3' --type yaml 2>/dev/nullRepository: NVIDIA-NeMo/RL
Length of output: 2726
Update model to Llama 3.1 or verify recipe compatibility with Llama 3.3.
Line 9 specifies meta-llama/Llama-3.3-70B-Instruct, but line 25 references llama31_70b_pretrain_config (Llama 3.1). No llama33_70b_pretrain_config exists in the codebase. Either change the model to meta-llama/Llama-3.1-70B-Instruct to match the recipe, or provide an alternative Llama 3.3–compatible recipe reference. All other similar configurations (e.g., grpo_math_8B_megatron.yaml, sft_openmathinstruct2_megatron.yaml) use matching model and recipe versions.
🤖 Prompt for AI Agents
In `@examples/configs/grpo_math_70B_megatron.yaml` around lines 25 - 26, The
config mixes Llama 3.3 model with a Llama 3.1 recipe: update either the model
string or the recipe to match versions—specifically change the model from
"meta-llama/Llama-3.3-70B-Instruct" to "meta-llama/Llama-3.1-70B-Instruct" to
match the existing megatron_recipe value "llama31_70b_pretrain_config", or
replace "megatron_recipe:
megatron.bridge.recipes.llama.llama3.llama31_70b_pretrain_config" with a Llama
3.3–compatible recipe if you have one; ensure consistency across similar configs
like grpo_math_8B_megatron.yaml and sft_openmathinstruct2_megatron.yaml.
| ${.megatron_cfg.context_parallel_size}}, 2} | ||
| megatron_cfg: | ||
| enabled: true | ||
| megatron_recipe: megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all Qwen3 related recipe files and configs
echo "=== Searching for Qwen3 recipe definitions ==="
find . -type f \( -name "*.py" -o -name "*.yaml" \) | xargs grep -l "qwen3.*recipe\|qwen3.*config" | head -20
echo -e "\n=== Looking for available Qwen3 megatron recipes ==="
rg "qwen3.*pretrain_config|def.*qwen3.*config" --type py -A 2 | head -50
echo -e "\n=== Checking for 1.7b Qwen3 recipe variants ==="
rg "qwen3.*1[._]7" --type py --type yaml
echo -e "\n=== Checking the specific recipe module path ==="
find . -path "*megatron/bridge/recipes/qwen/qwen3*" -type f | head -10Repository: NVIDIA-NeMo/RL
Length of output: 3262
🏁 Script executed:
#!/bin/bash
# Check the file in question and nearby distillation configs
echo "=== Current file content ==="
head -40 examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml
echo -e "\n=== Check other distillation YAML files for megatron_recipe usage ==="
find examples/configs/recipes/llm -name "*distillation*.yaml" -type f | xargs grep -l "megatron_recipe"
echo -e "\n=== Show megatron_recipe usage in distillation configs ==="
find examples/configs/recipes/llm -name "*distillation*.yaml" -type f -exec grep -H -B 2 -A 2 "megatron_recipe" {} \;Repository: NVIDIA-NeMo/RL
Length of output: 2029
🏁 Script executed:
#!/bin/bash
# Search for how megatron_recipe is used in the codebase (in Python code)
echo "=== How megatron_recipe is referenced in code ==="
rg "megatron_recipe" --type py -B 3 -A 3 | head -100
echo -e "\n=== Check if there's a mapping or factory for Qwen3 recipes ==="
rg "qwen3_.*pretrain_config|qwen3_.*model_config" --type pyRepository: NVIDIA-NeMo/RL
Length of output: 2617
🏁 Script executed:
#!/bin/bash
# Search for megatron.bridge recipe implementations
echo "=== Searching for megatron bridge recipes ==="
find . -path "*megatron*bridge*recipe*" -o -path "*megatron*recipe*" | grep -E "(\.py|\.yaml)$" | head -20
echo -e "\n=== Check if there's a separate install for megatron-bridge ==="
ls -la requirements*.txt pyproject.toml setup.py 2>/dev/null | head -10
echo -e "\n=== Search for any import of megatron.bridge ==="
rg "from megatron|import megatron" --type py -B 1 -A 1 | head -40
echo -e "\n=== Check the distillation_math.yaml base config ==="
cat examples/configs/distillation_math.yaml | head -100Repository: NVIDIA-NeMo/RL
Length of output: 6630
🏁 Script executed:
#!/bin/bash
# Check if there are test files or other distillation configs that might hint at expected models
echo "=== Check other distillation YAML configs to understand model/recipe patterns ==="
find examples/configs/recipes/llm -name "*distillation*.yaml" -type f -exec basename {} \; | sort
echo -e "\n=== Look at the FSDP variant for comparison ==="
head -40 examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yamlRepository: NVIDIA-NeMo/RL
Length of output: 1702
🏁 Script executed:
#!/bin/bash
# Check if the policy model_name is specified elsewhere or if it's inferred
echo "=== Search for how policy model_name is determined ==="
rg "model_name|student|policy" nemo_rl/models/ --type py -B 2 -A 2 | grep -A 5 -B 5 "1.7\|1_7\|student" | head -60
echo -e "\n=== Check if megatron_recipe can override model size or just provide base config ==="
rg "megatron_recipe|load_recipe" nemo_rl/models/ --type py -B 3 -A 3 | head -80Repository: NVIDIA-NeMo/RL
Length of output: 4475
🏁 Script executed:
#!/bin/bash
# Search for any Qwen3 recipe definitions in megatron-bridge or available recipes
echo "=== Search for all available Qwen3 recipes ==="
rg "qwen3.*pretrain_config|def.*qwen3_.*_config" -g "*.py"
echo -e "\n=== Check pyproject.toml for megatron-bridge dependency and version ==="
grep -A 10 -B 2 "megatron" pyproject.toml
echo -e "\n=== Check if there's a mapping of model sizes to recipes ==="
rg "1.7|1_7|1\.7|4b|8b|32b" nemo_rl/models/ --type py | grep -i recipeRepository: NVIDIA-NeMo/RL
Length of output: 2883
🏁 Script executed:
#!/bin/bash
# Check the git history or comments around this change
echo "=== Check if there are any recent changes or TODOs about this recipe ==="
rg "TODO|FIXME|XXX|qwen3_32b_pretrain_config" nemo_rl/ examples/ --type py --type yaml -B 2 -A 2
echo -e "\n=== Compare the megatron recipe variant with FSDP variant more carefully ==="
echo "=== FSDP (without megatron_recipe) ==="
grep -A 30 "^policy:" examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml | head -40
echo -e "\n=== Megatron (with megatron_recipe) ==="
grep -A 30 "^policy:" examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml | head -40Repository: NVIDIA-NeMo/RL
Length of output: 24191
Policy uses 32B megatron_recipe but should use 1.7B recipe for the student model.
The policy (student model) is configured with megatron_recipe: megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config, which provides the 32B model architecture. However, the policy model name inherited from base config is Qwen/Qwen3-1.7B-Base (1.7B). This mismatch will cause the student to be loaded with the wrong model architecture. Replace with a 1.7B recipe, or if no 1.7B recipe exists, remove the recipe and use the HuggingFace model directly like the FSDP variant does.
🤖 Prompt for AI Agents
In
`@examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml`
at line 25, The policy's megatron_recipe is pointing to the 32B config
(megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config) while the policy
model name is Qwen/Qwen3-1.7B-Base; change megatron_recipe to the matching 1.7B
recipe (if one exists) or remove the megatron_recipe entry so the policy uses
the HuggingFace model directly (like the FSDP variant) to ensure the student
model architecture matches Qwen/Qwen3-1.7B-Base.
| lr_warmup_init: 5.0e-08 | ||
| distributed_data_parallel_config: | ||
| overlap_grad_reduce: false | ||
| megatron_recipe: megatron.bridge.recipes.qwen.qwen2.qwen2_5_vl_3b_pretrain_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if the megatron bridge recipe module exists and inspect its TP default
echo "=== Searching for qwen2_5_vl_3b_pretrain_config ==="
rg -rn 'qwen2_5_vl_3b_pretrain_config' --type py -C5
echo ""
echo "=== Searching for tensor_model_parallel_size in related recipe files ==="
fd 'qwen2' --extension py --exec grep -Hn 'tensor_model_parallel' {} \;
echo ""
echo "=== Check base defaults file for TP setting ==="
fd 'vlm_grpo_3B.yaml' --exec cat {} \;Repository: NVIDIA-NeMo/RL
Length of output: 7552
🏁 Script executed:
# Also check the actual YAML file being reviewed
echo "=== Content of the reviewed YAML file ==="
cat -n examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yamlRepository: NVIDIA-NeMo/RL
Length of output: 589
Add explicit tensor_model_parallel_size: 2 override or correct the filename.
The filename advertises megatrontp2 but the verifiable configuration shows no TP=2 setting. The base defaults file specifies tensor_parallel_size: 1, and the reviewed file contains no override for it. Either add an explicit tensor_model_parallel_size: 2 under the policy section to match the filename, or rename the file to reflect the actual TP configuration (e.g., remove or change the megatrontp segment).
Per coding guidelines, the strategy segment in the VLM recipe naming pattern must accurately reflect the parallelism configuration.
🤖 Prompt for AI Agents
In
`@examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml`
at line 11, The recipe filename advertises TP=2 but the config lacks that
override; update the policy section to explicitly set
tensor_model_parallel_size: 2 (ensuring it overrides the base default
tensor_parallel_size: 1) so the effective parallelism matches the filename, or
alternatively rename the file to remove the "megatrontp2" segment; locate the
megatron_recipe entry and the policy block in the YAML (look for keys named
policy, tensor_model_parallel_size, and tensor_parallel_size) and add the
explicit tensor_model_parallel_size: 2 entry or adjust the filename accordingly.
| if not os.path.exists(pretrained_run_config): | ||
| raise FileNotFoundError( | ||
| f"Pretrained run config not found at {pretrained_run_config} on rank={rank}. " | ||
| "This usually means that the one-time HF->mcore conversion on rank=0 saved to a directory " | ||
| "not being mounted on this node. Please check" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete error message — sentence ends with "Please check".
The FileNotFoundError message is truncated mid-sentence.
📝 Proposed fix
raise FileNotFoundError(
f"Pretrained run config not found at {pretrained_run_config} on rank={rank}. "
"This usually means that the one-time HF->mcore conversion on rank=0 saved to a directory "
- "not being mounted on this node. Please check"
+ "not being mounted on this node. Please check that the shared filesystem is accessible."
)🧰 Tools
🪛 Ruff (0.14.14)
[warning] 299-303: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@nemo_rl/models/megatron/setup.py` around lines 298 - 303, The
FileNotFoundError message for pretrained_run_config is truncated; update the
exception text (in the block that checks os.path.exists(pretrained_run_config)
where pretrained_run_config and rank are referenced) to complete the sentence —
e.g., explain that the one-time HF->mcore conversion on rank=0 may have written
to a directory not mounted on this node and instruct the user to verify that the
directory is mounted and that rank 0 wrote the config (include the path and rank
variables in the message for clarity).
| def _validate_optimizer_config(megatron_cfg: ConfigContainer) -> None: | ||
| """Validate optimizer configuration.""" | ||
| optimizer_cpu_offload = config["megatron_cfg"]["optimizer"]["optimizer_cpu_offload"] | ||
| optimizer_offload_fraction = config["megatron_cfg"]["optimizer"][ | ||
| "optimizer_offload_fraction" | ||
| ] | ||
| optimizer_cpu_offload = megatron_cfg.optimizer.optimizer_cpu_offload | ||
| optimizer_offload_fraction = megatron_cfg.optimizer.optimizer_offload_fraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file structure and the specific lines mentioned
wc -l nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 95
🏁 Script executed:
# Look at lines 493-496 and surrounding context
sed -n '490,510p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 1016
🏁 Script executed:
# Look at lines 640-645 which allegedly set optimizer only when overrides exist
sed -n '635,650p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 861
🏁 Script executed:
# Search for where _validate_optimizer_config is called
rg -n '_validate_optimizer_config' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 181
🏁 Script executed:
# Look at context around line 354 where _validate_optimizer_config is called
sed -n '345,365p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 760
🏁 Script executed:
# Look for where megatron_cfg is initialized before line 354
sed -n '320,360p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 1341
🏁 Script executed:
# Search for any defaults or initialization of optimizer in the file
rg -n 'megatron_cfg.optimizer\s*=' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 137
🏁 Script executed:
# Let's check what _validate_dtype_config does with the optimizer parameter
sed -n '470,495p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 1142
🏁 Script executed:
# Check for _validate_dtype_config definition
rg -n 'def _validate_dtype_config' nemo_rl/models/megatron/setup.py -A 10Repository: NVIDIA-NeMo/RL
Length of output: 688
🏁 Script executed:
# Check if optimizer is loaded from base config or recipes
rg -n 'OptimizerConfig\|optimizer.*=' nemo_rl/models/megatron/setup.py | head -20Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Let's see the full _validate_dtype_config function to see how it handles optimizer_cfg
sed -n '559,580p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 1193
🏁 Script executed:
# Check if there's a None guard in _validate_dtype_config
rg -n 'if optimizer_cfg' nemo_rl/models/megatron/setup.py -A 2 -B 2Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Check the complete _update_megatron_config function to see all optimizer initialization paths
rg -n 'def _update_megatron_config' nemo_rl/models/megatron/setup.py -A 50 | head -60Repository: NVIDIA-NeMo/RL
Length of output: 2431
🏁 Script executed:
# Check what recipes are available and if they include optimizer config
find . -name "*.yaml" -path "*example*" -o -name "*.yaml" -path "*recipe*" | grep -i recipe | head -20Repository: NVIDIA-NeMo/RL
Length of output: 1758
🏁 Script executed:
# Look for recipe definitions
fd -e yaml . --path "*recipe*" | head -20Repository: NVIDIA-NeMo/RL
Length of output: 291
🏁 Script executed:
# Check example configs to see if optimizer is typically included
fd -e yaml . examples/configs/ 2>/dev/null | head -10Repository: NVIDIA-NeMo/RL
Length of output: 429
🏁 Script executed:
# Check a few recipes to see if they define optimizer config
cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n4g-fsdp2tp1-quick.v2.yaml | grep -A 20 "megatron_cfg:"Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Check another megatron-based recipe
cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml | grep -A 30 "megatron_cfg:" | head -40Repository: NVIDIA-NeMo/RL
Length of output: 312
🏁 Script executed:
# Check a base example config
cat examples/configs/dpo.yaml | grep -A 50 "megatron_cfg:"Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Let's search for optimizer config in recipe files
rg -l "optimizer:" examples/configs/recipes/llm/ | head -5Repository: NVIDIA-NeMo/RL
Length of output: 429
🏁 Script executed:
# Check if any recipe has optimizer in megatron_cfg
rg "megatron_cfg:" examples/configs/recipes/llm/ -A 30 | grep -B5 -A5 "optimizer" | head -30Repository: NVIDIA-NeMo/RL
Length of output: 2533
🏁 Script executed:
# Look for a more complete megatron recipe file
wc -l examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml
head -100 examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yamlRepository: NVIDIA-NeMo/RL
Length of output: 976
🏁 Script executed:
# Let's check if the recipe loading from megatron.bridge initializes optimizer
# First, let's see what megatron_recipe parameter does
rg -n 'megatron_recipe' nemo_rl/models/megatron/setup.py -B2 -A2Repository: NVIDIA-NeMo/RL
Length of output: 542
🏁 Script executed:
# Check how megatron_cfg is loaded from the recipe
rg -n 'import_module_from_config\|load.*recipe\|get_megatron_config' nemo_rl/models/megatron/setup.py -B2 -A2Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Search for where megatron_cfg is initially created/loaded before _update_megatron_config
sed -n '280,320p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 1883
🏁 Script executed:
# Let's check what load_recipe does and where it comes from
rg -n 'def load_recipe\|from.*load_recipe\|import.*load_recipe' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Check imports at the top of the file
head -50 nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 1672
🏁 Script executed:
# Let's check if there's a test or example that shows what happens when optimizer is not provided
rg -n 'optimizer.*None\|None.*optimizer' nemo_rl/models/megatron/setup.py -B2 -A2 | head -40Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Search for load_recipe import
rg -n 'load_recipe' nemo_rl/models/megatron/setup.py -B5 | head -20Repository: NVIDIA-NeMo/RL
Length of output: 645
🏁 Script executed:
# Look at the complete imports section
sed -n '1,80p' nemo_rl/models/megatron/setup.py | tail -40Repository: NVIDIA-NeMo/RL
Length of output: 1595
🏁 Script executed:
# Let's look at what _validate_dtype_config actually does with a None optimizer
sed -n '559,600p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 2247
🏁 Script executed:
# Let's check the order of calls and see all validation functions
sed -n '350,365p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 523
🏁 Script executed:
# Check if _validate_dtype_config can handle None optimizer_cfg
rg -n '_validate_dtype_config\|optimizer_cfg' nemo_rl/models/megatron/setup.py | grep -A2 -B2 "354\|355\|559"Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Look for where optimizer might be None and see if there's any guard
rg -n 'optimizer.*None' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 255
🏁 Script executed:
# Check if all recipes in examples actually include optimizer config
rg "megatron_cfg:" examples/configs/recipes/llm/*.yaml -A 20 | grep -c "optimizer:"Repository: NVIDIA-NeMo/RL
Length of output: 60
🏁 Script executed:
# Let's see line 662 and surrounding context to understand another place that checks optimizer is not None
sed -n '660,680p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 906
🏁 Script executed:
# Let's check what happens at line 813
sed -n '810,830p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 892
🏁 Script executed:
# Let's trace through - what does load_recipe return? Check the recipe_config module
cat nemo_rl/models/megatron/recipe_config.py | head -50Repository: NVIDIA-NeMo/RL
Length of output: 1816
🏁 Script executed:
# Let's check what happens when a recipe doesn't have optimizer - is that possible?
# Look at what load_recipe returns and check recipe examples
rg -n "def load_recipe" nemo_rl/models/megatron/recipe_config.py -A 30Repository: NVIDIA-NeMo/RL
Length of output: 1315
🏁 Script executed:
# Check if recipes without optimizer in megatron_cfg are possible
# Let's look at a simple FSDP recipe that doesn't have megatron_cfg.optimizer
cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n4g-fsdp2tp1-quick.v2.yamlRepository: NVIDIA-NeMo/RL
Length of output: 274
🏁 Script executed:
# Let's check what happens when we have a recipe without optimizer and see if there's actual flow
# Look for tests that might show this scenario
find . -name "*test*.py" -path "*/megatron/*" | head -5Repository: NVIDIA-NeMo/RL
Length of output: 209
🏁 Script executed:
# Let's check the test file for megatron setup
cat tests/unit/models/megatron/test_megatron_setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 39612
🏁 Script executed:
# Let's also check what the actual parent yaml for the FSDP recipe contains
cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml | grep -A 30 "megatron_cfg:"Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Check if there are non-megatron recipes that may have None optimizer
rg "megatron_recipe:" examples/configs/ -lRepository: NVIDIA-NeMo/RL
Length of output: 1888
🏁 Script executed:
# Let's look more closely at the test - specifically how it's calling the function
sed -n '1010,1030p' tests/unit/models/megatron/test_megatron_setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 827
🏁 Script executed:
# Check if there are any recipes without megatron_recipe that use megatron_cfg
rg "megatron_cfg:" examples/configs/ -l | while read file; do
if ! grep -q "megatron_recipe:" "$file"; then
echo "=== $file ==="
grep -A 10 "megatron_cfg:" "$file" | head -15
fi
doneRepository: NVIDIA-NeMo/RL
Length of output: 4802
🏁 Script executed:
# Now let's trace the actual execution path - what happens at line 354?
# Check if there's guaranteed initialization of optimizer before _validate_optimizer_config
sed -n '280,365p' nemo_rl/models/megatron/setup.py | grep -A5 -B5 "optimizer\|_validate"Repository: NVIDIA-NeMo/RL
Length of output: 920
🏁 Script executed:
# Let's check what happens when a recipe like grpo-deepseek-v3-32n8g doesn't have optimizer
# Look for how recipes set optimizer and whether it defaults to something
sed -n '287,295p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 455
🏁 Script executed:
# Let's trace the actual execution - what does load_recipe return?
# Check if recipes without optimizer defined leave it as None
rg "megatron.bridge.recipes" examples/configs/ | head -3Repository: NVIDIA-NeMo/RL
Length of output: 452
🏁 Script executed:
# Let's check if there's a default recipe function to understand what load_recipe returns
python3 << 'EOF'
# Simulate what happens when loading a recipe
# The recipe functions are imported from megatron.bridge
# If a recipe function is called and doesn't set optimizer, what happens?
# We need to understand ConfigContainer behavior
import sys
sys.path.insert(0, '/tmp')
# We can't import megatron.bridge in this environment, but we can reason about it
# based on the code structure shown
# From the code, load_recipe calls a recipe function that returns a ConfigContainer
# If the recipe doesn't set optimizer, megatron_cfg.optimizer would be whatever the
# default is in ConfigContainer - likely None
print("When a recipe is loaded:")
print("1. load_recipe() calls the recipe function")
print("2. Recipe function returns a ConfigContainer")
print("3. If recipe doesn't initialize optimizer, it would be None by default")
print("")
print("Then in _update_megatron_config:")
print("1. If no optimizer overrides in config, the None value is NOT modified")
print("2. So megatron_cfg.optimizer remains None")
print("")
print("Then at line 352: _validate_dtype_config accesses optimizer_cfg.use_precision_aware_optimizer")
print("Then at line 354: _validate_optimizer_config accesses optimizer.optimizer_cpu_offload")
print("Both would crash if optimizer is None")
EOFRepository: NVIDIA-NeMo/RL
Length of output: 607
🏁 Script executed:
# Check if _validate_dtype_config has any None guard
sed -n '559,590p' nemo_rl/models/megatron/setup.py | grep -i "none\|if.*optimizer"Repository: NVIDIA-NeMo/RL
Length of output: 349
Add guard against None optimizer before accessing its attributes.
When a Megatron-Bridge recipe doesn't define an optimizer configuration, megatron_cfg.optimizer remains None. The code then crashes at lines 352 and 354 when attempting to access optimizer attributes without checking:
- Line 352:
_validate_dtype_config()accessesoptimizer_cfg.use_precision_aware_optimizerandoptimizer_cfg.bf16/optimizer_cfg.fp16 - Line 354:
_validate_optimizer_config()accessesoptimizer.optimizer_cpu_offloadandoptimizer.optimizer_offload_fraction
The issue occurs because _update_megatron_config() (lines 640–645) only modifies optimizer if config overrides are present; if the recipe lacks optimizer config and no overrides are specified, the None value remains unchanged.
🛡️ Proposed fixes
For _validate_optimizer_config:
def _validate_optimizer_config(megatron_cfg: ConfigContainer) -> None:
"""Validate optimizer configuration."""
+ if megatron_cfg.optimizer is None:
+ return
optimizer_cpu_offload = megatron_cfg.optimizer.optimizer_cpu_offload
optimizer_offload_fraction = megatron_cfg.optimizer.optimizer_offload_fractionFor _validate_dtype_config, add the same guard at the start, or ensure optimizer is never passed as None.
🤖 Prompt for AI Agents
In `@nemo_rl/models/megatron/setup.py` around lines 493 - 496, The validators
access megatron_cfg.optimizer without checking for None; add a guard at the
start of _validate_optimizer_config (and similarly in _validate_dtype_config) to
return early if megatron_cfg.optimizer is None (or otherwise bail with a no-op)
so you don't dereference None when reading optimizer_cpu_offload,
optimizer_offload_fraction, use_precision_aware_optimizer, bf16 or fp16;
reference the megatron_cfg.optimizer object in those functions and ensure any
downstream logic expects a present optimizer only after this check.
| print("HELLO") | ||
| # Dump ConfigContainer to YAML for inspection (only on rank 0) | ||
| if self.rank == 0: | ||
| config_dump_path = "/lustre/fsw/portfolios/coreai/users/sfawzy/final_megatron_config_6.yaml" | ||
| try: | ||
| self.megatron_cfg.to_yaml(config_dump_path) | ||
| print(f"[DEBUG] Saved final ConfigContainer to: {config_dump_path}") | ||
| except Exception as e: | ||
| print(f"[WARNING] Failed to save ConfigContainer to YAML: {e}") | ||
| # Exit early after dumping config for inspection | ||
| import sys | ||
| print("[DEBUG] Exiting after ConfigContainer dump") | ||
| sys.exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug code: sys.exit(0) terminates the worker during __init__, deadlocking all other ranks.
This block:
- Prints
"HELLO"unconditionally on all ranks. - On rank 0, writes to a hardcoded user-specific Lustre path, then calls
sys.exit(0)— killing the process mid-initialization. - All non-zero ranks will hang indefinitely at the next collective operation, causing a distributed deadlock.
- The late
import sysinside the conditional is also non-idiomatic.
This is development/debug code that must be removed before merging. As per coding guidelines, debug artifacts should be removed before merging.
🗑️ Proposed fix: remove debug block
- print("HELLO")
- # Dump ConfigContainer to YAML for inspection (only on rank 0)
- if self.rank == 0:
- config_dump_path = "/lustre/fsw/portfolios/coreai/users/sfawzy/final_megatron_config_6.yaml"
- try:
- self.megatron_cfg.to_yaml(config_dump_path)
- print(f"[DEBUG] Saved final ConfigContainer to: {config_dump_path}")
- except Exception as e:
- print(f"[WARNING] Failed to save ConfigContainer to YAML: {e}")
- # Exit early after dumping config for inspection
- import sys
- print("[DEBUG] Exiting after ConfigContainer dump")
- sys.exit(0)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print("HELLO") | |
| # Dump ConfigContainer to YAML for inspection (only on rank 0) | |
| if self.rank == 0: | |
| config_dump_path = "/lustre/fsw/portfolios/coreai/users/sfawzy/final_megatron_config_6.yaml" | |
| try: | |
| self.megatron_cfg.to_yaml(config_dump_path) | |
| print(f"[DEBUG] Saved final ConfigContainer to: {config_dump_path}") | |
| except Exception as e: | |
| print(f"[WARNING] Failed to save ConfigContainer to YAML: {e}") | |
| # Exit early after dumping config for inspection | |
| import sys | |
| print("[DEBUG] Exiting after ConfigContainer dump") | |
| sys.exit(0) |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 288-288: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` around lines 281 -
293, Remove the debug block that prints "HELLO", writes a hardcoded YAML path,
and calls sys.exit during initialization: delete the unconditional
print("HELLO") and the entire rank-0 conditional that calls
self.megatron_cfg.to_yaml(...), imports sys, prints debug/warning messages, and
invokes sys.exit(0). Ensure no hardcoded paths remain and do not perform any
early process termination in __init__; if you need to persist the config,
replace this with a safe, optional debug hook (e.g., gated by a proper config
flag) that runs non-destructively and is synchronized across ranks rather than
exiting the process.
What does this PR do ?
Use Megatron-Bridge recipes for megatron_cfg.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Chores